Red Wine Quality analysis by Lazzat Sultanbek

The report explores data set on the chemical properties of the wine.

Univariate Plots Section

## Observations: 1,599
## Variables: 13
## $ X                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality              <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

There are 1,599 observations and 12 variables. Note that X represents the numbering of the sample and not the variable by itself. Quality is the output variable that we will be exploring.

Lets drop unused variable X.

I am also creating a new varible called rating which will split wine into 3 categories: bad, good and excellent.

Prior to proceding to plots we cleaned up the data by 1) transforming quality
from integer to ordered factor, 2) dropping variable X which is just a numbering of the data and 3) creating a new variable called rating (bad, average, excellent).

Majority of red wine sample got rating of 5 and 6. There are no observations were red wine got 1,2 rating or 9,10 rating.

When we plot data by rating we can see that majority of wine falls into “good” wine category with quality score of 5-6 and next up comes excellent category (scores equal to and higher than 7). Only a small proportion of wine falls into “bad” rating category.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Most red wines have fixed acidity ranging between 7 to 9. There are some outliers which range up to 15.90. Median value of red wines is 7.90.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Most red wines have volatile acidity ranging between 0.39 to 0.64. There are some outliers which range up to 1.58.

We see that the data is skewed to the right, lets remove outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Citric acid values range from 0 to 1, although 75% of the population falls under 0.42.

Residual sugar values tends to vary significantly thoughout the sample population. It would be interesting to see if there is any correlation with quality of wine. Residual sugar values range from 0 to 16, however zooming in we see that most red wines have residual sugar of 1.5 to 2.5.

Another chemical property skewed to the right is chlorides. Eliminating those outliers and zooming in we can see that most red wines have chlorides with values of 0.05 to 0.09.

Similar to chlorides and residual sugar, free sulfur dioxide levels also skewed to the right. There seems to be a pattern here and it would be interesting to later analyze if there is a correlation between these 3 variables. Are the same samples appearing as outliers in plots for all three (chlorides, residual sugar, free sulfur dioxide)?

Zooming into free sulfur dioxide we see that majority of wine range from 3 to 40.There is a spike at values 5 and 6, which will be interesting to explore later.

Most red wine in the sample have total sulfur dioxide that ranges from 10 to 50.

Density and pH values look normally distributed, most wines having density level from 0.995 to 1 and pH ranging from 3.1 to 3.5.

Sulphates are skewed to the right with most wines ranging between 0.5 and 0.7.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Alcohol level of wines in population has a mean of 10.42 and median of 10.20. About 75% of wine have alcohol level lower than 11.10.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar 
##  Min.   : 4.900   Min.   :0.1200   Min.   :0.0000   Min.   :1.200  
##  1st Qu.: 7.400   1st Qu.:0.3000   1st Qu.:0.3000   1st Qu.:2.000  
##  Median : 8.700   Median :0.3700   Median :0.4000   Median :2.300  
##  Mean   : 8.847   Mean   :0.4055   Mean   :0.3765   Mean   :2.709  
##  3rd Qu.:10.100   3rd Qu.:0.4900   3rd Qu.:0.4900   3rd Qu.:2.700  
##  Max.   :15.600   Max.   :0.9150   Max.   :0.7600   Max.   :8.900  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 3.00       Min.   :  7.00      
##  1st Qu.:0.06200   1st Qu.: 6.00       1st Qu.: 17.00      
##  Median :0.07300   Median :11.00       Median : 27.00      
##  Mean   :0.07591   Mean   :13.98       Mean   : 34.89      
##  3rd Qu.:0.08500   3rd Qu.:18.00       3rd Qu.: 43.00      
##  Max.   :0.35800   Max.   :54.00       Max.   :289.00      
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9906   Min.   :2.880   Min.   :0.3900   Min.   : 9.20   3:  0  
##  1st Qu.:0.9947   1st Qu.:3.200   1st Qu.:0.6500   1st Qu.:10.80   4:  0  
##  Median :0.9957   Median :3.270   Median :0.7400   Median :11.60   5:  0  
##  Mean   :0.9960   Mean   :3.289   Mean   :0.7435   Mean   :11.52   6:  0  
##  3rd Qu.:0.9973   3rd Qu.:3.380   3rd Qu.:0.8200   3rd Qu.:12.20   7:199  
##  Max.   :1.0032   Max.   :3.780   Max.   :1.3600   Max.   :14.00   8: 18  
##        rating   
##  bad      :  0  
##  good     :  0  
##  excellent:217  
##                 
##                 
## 
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 4.600   Min.   :0.2300   Min.   :0.0000   Min.   : 1.200  
##  1st Qu.: 6.800   1st Qu.:0.5650   1st Qu.:0.0200   1st Qu.: 1.900  
##  Median : 7.500   Median :0.6800   Median :0.0800   Median : 2.100  
##  Mean   : 7.871   Mean   :0.7242   Mean   :0.1737   Mean   : 2.685  
##  3rd Qu.: 8.400   3rd Qu.:0.8825   3rd Qu.:0.2700   3rd Qu.: 2.950  
##  Max.   :12.500   Max.   :1.5800   Max.   :1.0000   Max.   :12.900  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.04500   Min.   : 3.00       Min.   :  7.00      
##  1st Qu.:0.06850   1st Qu.: 5.00       1st Qu.: 13.50      
##  Median :0.08000   Median : 9.00       Median : 26.00      
##  Mean   :0.09573   Mean   :12.06       Mean   : 34.44      
##  3rd Qu.:0.09450   3rd Qu.:15.50       3rd Qu.: 48.00      
##  Max.   :0.61000   Max.   :41.00       Max.   :119.00      
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9934   Min.   :2.740   Min.   :0.3300   Min.   : 8.40   3:10   
##  1st Qu.:0.9957   1st Qu.:3.300   1st Qu.:0.4950   1st Qu.: 9.60   4:53   
##  Median :0.9966   Median :3.380   Median :0.5600   Median :10.00   5: 0   
##  Mean   :0.9967   Mean   :3.384   Mean   :0.5922   Mean   :10.22   6: 0   
##  3rd Qu.:0.9977   3rd Qu.:3.500   3rd Qu.:0.6000   3rd Qu.:11.00   7: 0   
##  Max.   :1.0010   Max.   :3.900   Max.   :2.0000   Max.   :13.10   8: 0   
##        rating  
##  bad      :63  
##  good     : 0  
##  excellent: 0  
##                
##                
## 
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 4.700   Min.   :0.1600   Min.   :0.0000   Min.   : 0.900  
##  1st Qu.: 7.100   1st Qu.:0.4100   1st Qu.:0.0900   1st Qu.: 1.900  
##  Median : 7.800   Median :0.5400   Median :0.2400   Median : 2.200  
##  Mean   : 8.254   Mean   :0.5386   Mean   :0.2583   Mean   : 2.504  
##  3rd Qu.: 9.100   3rd Qu.:0.6400   3rd Qu.:0.4000   3rd Qu.: 2.600  
##  Max.   :15.900   Max.   :1.3300   Max.   :0.7900   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.03400   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07100   1st Qu.: 8.00       1st Qu.: 24.00      
##  Median :0.08000   Median :14.00       Median : 40.00      
##  Mean   :0.08897   Mean   :16.37       Mean   : 48.95      
##  3rd Qu.:0.09100   3rd Qu.:22.00       3rd Qu.: 65.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :165.00      
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9901   Min.   :2.860   Min.   :0.3700   Min.   : 8.40   3:  0  
##  1st Qu.:0.9958   1st Qu.:3.210   1st Qu.:0.5400   1st Qu.: 9.50   4:  0  
##  Median :0.9968   Median :3.310   Median :0.6100   Median :10.00   5:681  
##  Mean   :0.9969   Mean   :3.311   Mean   :0.6473   Mean   :10.25   6:638  
##  3rd Qu.:0.9979   3rd Qu.:3.400   3rd Qu.:0.7000   3rd Qu.:10.90   7:  0  
##  Max.   :1.0037   Max.   :4.010   Max.   :1.9800   Max.   :14.90   8:  0  
##        rating    
##  bad      :   0  
##  good     :1319  
##  excellent:   0  
##                  
##                  
## 

Univariate Analysis

What is the structure of your dataset?

There are 1,599 observations in the dataset with 12 variables. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). Variables in the dataset are:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines.
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
5 - chlorides: the amount of salt in the wine.
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content.
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.
11 - alcohol: the percent alcohol content of the wine.

Output variable (based on sensory data):
12 - quality (score between 0 and 10)

What is/are the main feature(s) of interest in your dataset?

Main feature of interest in the dataset is the quality of wine and other variables which directly or in collaboration with other characteristics impact the quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Comparing qualities of bad, good and excellent wines, volatile.acidity and citric.acid differs significantly for each rating and therefore hinting those are the qualities impacting the quality of wine. Another characteristic that would be interesting to explore is the level of alcohol and its impact on quality of wine.

Did you create any new variables from existing variables in the dataset?

I created a rating variable to group the population into 3 categories: bad, good and excellent.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form

of the data? If so, why did you do this?

Chlorides, residual sugar and free sulfur dioxide levels are all skewed to the right. There seems to be a pattern here and it would be interesting to later analyze if there is a correlation between these 3 variables. Are the same samples appearing as outliers in plots for all three (chlorides, residual sugar, free sulfur dioxide)?

I deleted the variable X from the dataset which was just a numbering of sample and not the characteristic of wine. I also transformed quality from an integer to an ordered factor.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.26        0.67
## volatile.acidity             -0.26             1.00       -0.55
## citric.acid                   0.67            -0.55        1.00
## residual.sugar                0.11             0.00        0.14
## chlorides                     0.09             0.06        0.20
## free.sulfur.dioxide          -0.15            -0.01       -0.06
## total.sulfur.dioxide         -0.11             0.08        0.04
## density                       0.67             0.02        0.36
## pH                           -0.68             0.23       -0.54
## sulphates                     0.18            -0.26        0.31
## alcohol                      -0.06            -0.20        0.11
## quality                       0.12            -0.39        0.23
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
##                      quality
## fixed.acidity           0.12
## volatile.acidity       -0.39
## citric.acid             0.23
## residual.sugar          0.01
## chlorides              -0.13
## free.sulfur.dioxide    -0.05
## total.sulfur.dioxide   -0.19
## density                -0.17
## pH                     -0.06
## sulphates               0.25
## alcohol                 0.48
## quality                 1.00

Before I start bivariate plots analysis I would like to run ggpairs to see the relationship between different variables.

Lets first explore what qualities of wine correlate with quality. From the above correlation matrix, we see that quality has the highest correlation with Alcohol and negative correlation with volatile acidity.

The box plot of relationship of quality vs alcohol is quite interesting - it appears that higher quality wines have higher percent of alcohol content.

Box plot of quality against volatile acidity shows that normally lower quality wines have higher level of volatile acidity and vice versa higher quality wines have lower level of volatile acidity. This makes sense because volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.

There is a slight correlation of Quality vs citric.acid and it appears that higher quality wines have slightly higher level of citric acid. This also makes sense given that citric acid can add ‘freshness’ and flavor to wines.

pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. The above boxplot shows that wines in our sample are mainly within the 3-4 range. In general, better quality wines tend to have a lower pH scale, with some outliers of course.

Next lets explore which chemical qualities alcohol and volatile.acidity are correlated with.

Alcohol has high correlation with Density, so lets plot that first.

There is negative correlation between density and alcohol which makes sense given that the density of water is close to that of water depending on the percent alcohol and sugar content.

Per correlation matrix volatile.acidity has the highest correlation with citric.acid, lets plot to see how that relationship looks like.

There is slightly negative relationship between volatile.acidity and citric.acid. The higher levels of citric.acid is associated with lower level of volatile.acidity.

Per above the higher the level of fixed.acidity the higher is the level of density.

Citric.acid is positively correlated with density level. Higher citric.acid indicates higher density.

There is positive correlation between fixed.acidity and citric.acid levels.

It appears pH and fixed.acidity have high negative correlation. pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. Per chart above wines with lower levels of fixed.acidity are less acidic.

Per chart higher level of citric.acid are associated with lower pH, i.e. higher citric.acid means less acidic.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

My main feature of interest in the above analysis was what variables were associated with quality of wine. I noticed that quality is has positive correlation with alcohol and negative correlation with volatile acidity.
1. It appears that higher quality wines have higher percent of alcohol content.
2. Box plot of quality against volatile acidity shows that lower quality wines have higher level of volatile acidity and higher quality wines have lower level of volatile acidity. This makes sense because volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
3. There is a slight correlation of Quality vs citric.acid and it appears that higher quality wines have slightly higher level of citric acid. This also makes sense given that citric acid can add ‘freshness’ and flavor to wines.
4. pH vs Quality shows that wines in our sample are mainly within the 3-4 range. In general, better quality wines tend to have a lower pH scale, with some outliers of course.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

In my analysis I explored relationships for those variables which appeared to have the highest correlation indices. There were some more obvious observations like negative correlation between density and alcohol and negative relationship between volatile.acidity and citric.acid.

However, the relationship I found interesting was between fixed.acidity and other variables. There is positive correlation between fixed.acidity and citric.acid levels and citric.acid known to bring freshness and flavor to wine. While pH and fixed.acidity have high negative correlation. Wines with lower levels of fixed.acidity appear less acidic. Also, higher the level of fixed.acidity the higher is the level of density.

What was the strongest relationship you found?

Fixed.acidity and pH is the strongest relationship I found.

Multivariate Plots Section

This is interesting. We can see that high alcohol content and high citric.acid provide high quality wine, while the low alcohol content and low citric.acid get the low quality score. Interestingly, citric.acid impacts quality score more than the alcohol content. We can observe this for some high quality wines which have lower alcohol content but high citric.acid.

Plotting alcohol against volatile.acidity (the two variables highly correlated with quality), we see the expected tendency: the higher alcohol and low volatile.acidity results in higher quality. However, we see some outliers where high alcohol content and higher volatile.acidity results in lower quality score vs lower alcohol content and lower volatile.acidity.

Lets now plot citric.acid_vs_volatile.acidity and split by quality and see what happens.

What we see here is that low citric.acid and high volatile.acidity results in low quality score, which is expected based on our observations above. However, wine with medium volatile.acidity and high citric.acid will still get a low quality score. This shows that although both important it is actually more impactful to have lower volatile.acidity than having high citric.acid.

Per above chart, we see that positive correlation between fixed.acidity and citric.acid is consistent across all three categories of wine (bad, good, excellent). However, this relationship is more obvious with highest scoring wine category where fixed.acidity level increases along with citric.acid level increase.

Fixed.acidity and density relationship by wine category reveals that higher quality wines appear less dense compared to other categories for the same level of fixed.acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I explored further some of the relationships explored in previous sections by adding quality varibles and seeing if the relationship holds true across all categories of wine.

  1. Fixed.acidity and citric.acid is consistent across all three categories of wine (bad, good, excellent).

  2. Alcohol against volatile.acidity show the expected tendency: the higher alcohol and low volatile.acidity results in higher quality.

  3. As previously noted high alcohol content and high citric.acid provide highest quality wine, while the low alcohol content and low citric.acid get the lowest quality score.

  4. Fixed.acidity and density relationship by wine category reveals that higher quality wines appear less dense compared to other categories for the same level of fixed.acidity.

Were there any interesting or surprising interactions between features?

Some interesting observations are:

1.Low citric.acid and high volatile.acidity results in lowest quality score, which is expected based on previous observations. However, wine with high volatile.acidity and high citric.acid will still get a low quality score (4). This shows that although both important it is actually more impactful to have lower volatile.acidity than having high citric.acid.

2. In citric.acid vs alcohol chart we notice that citric.acid impacts quality score more than the alcohol content. This we can see for quality score 5 which has higher alcohol content but less citric.acid vs quality score 7 which has lower alcohol content but higher citric.acid.

3. Alcohol vs volatile.acidity relationship related to quality #5 has high alcohol content and higher volatile.acidity results in lower quality score vs lower alcohol content and lower volatile.acidity. This means that volatile.acidity impacts quality score more than the alcohol content.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

N/A

Final Plots and Summary

One of the highest correlations we have noted in the dataset is alcohol vs quality.

Description One

It appears that higher quality wines have higher percent of alcohol content.

Plot Two

Description Two

Bivariate analysis performed above also showed that there is high positive correlation between quality and citric acid and negative correlation between quality and volatile acidity. Here I am plotting citric acid against volatile acidity by quality score. We can observe that low citric.acid and high volatile.acidity results in lowest quality score (3). However, wine with high volatile.acidity and high citric.acid will still get a low quality score (4). This shows that although both important it is actually more impactful to have lower volatile.acidity than having high citric.acid.

Plot Three

Description Three

Our bivariate analysis show that there is negative correlation between density and alcohol which makes sense given that the density of water is close to that of water depending on the percent alcohol and sugar content.We also observe that there is positive correlation between fixed.acidity and citric.acid levels.

Therefore, I also want to see how the relationship between fixed.acidity and density relationship. Fixed.acidity and density relationship by wine category reveals that higher quality wines appear less dense compared to other categories for the same level of fixed.acidity.
——

Reflection

There are 1,599 observations in the dataset with 12 variables. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

In univariate analysis I noted that our dataset shows that majority of wine are ranked in quality score of 5 and 6.

Comparing qualities of bad, good and excellent wines, volatile.acidity and citric.acid differs significantly for each rating and therefore hinting those are the qualities impacting the quality of wine. Another characteristic that would be interesting to explore is the level of alcohol and its impact on quality of wine.

Further, I explored bivariate relationships between different chemical components. In my analysis I explored relationships for those variables which appeared to have the highest correlation indices. There were some more obvious observations like negative correlation between density and alcohol and negative relationship between volatile.acidity and citric.acid.

There is positive correlation between fixed.acidity and citric.acid levels and citric.acid known to bring freshness and flavor to wine. While pH and fixed.acidity have high negative correlation. Wines with lower levels of fixed.acidity appear less acidic. Also, higher the level of fixed.acidity the higher is the level of density.

In multivariate analysis we observed that fixed.acidity and citric.acid relationship is consistent across all three categories of wine (bad,good,excellent).

Also, the higher alcohol and low volatile.acidity results in higher quality. High alcohol content and high citric.acid provide highest quality wine, while the low alcohol content and low citric.acid get the lowest quality score. Fixed.acidity and density relationship by wine category reveals that higher quality wines appear less dense compared to other categories for the same level of fixed.acidity.

You can find these and other observations described in the above sections.

What would be interesting to add in the dataset for the future is the region where wine is from.